apache kafka
Real-Time Health Analytics Using Ontology-Driven Complex Event Processing and LLM Reasoning: A Tuberculosis Case Study
Chandra, Ritesh, Agarwal, Sonali, Singh, Navjot
Timely detection of critical health conditions remains a major challenge in public health analytics, especially in Big Data environments characterized by high volume, rapid velocity, and diverse variety of clinical data. This study presents an ontology-enabled real-time analytics framework that integrates Complex Event Processing (CEP) and Large Language Models (LLMs) to enable intelligent health event detection and semantic reasoning over heterogeneous, high-velocity health data streams. The architecture leverages the Basic Formal Ontology (BFO) and Semantic Web Rule Language (SWRL) to model diagnostic rules and domain knowledge. Patient data is ingested and processed using Apache Kafka and Spark Streaming, where CEP engines detect clinically significant event patterns. LLMs support adaptive reasoning, event interpretation, and ontology refinement. Clinical information is semantically structured as Resource Description Framework (RDF) triples in Graph DB, enabling SPARQL-based querying and knowledge-driven decision support. The framework is evaluated using a dataset of 1,000 Tuberculosis (TB) patients as a use case, demonstrating low-latency event detection, scalable reasoning, and high model performance (in terms of precision, recall, and F1-score). These results validate the system's potential for generalizable, real-time health analytics in complex Big Data scenarios.
Practical Performance of a Distributed Processing Framework for Machine-Learning-based NIDS
Kajiura, Maho, Nakamura, Junya
Network Intrusion Detection Systems (NIDSs) detect intrusion attacks in network traffic. In particular, machine-learning-based NIDSs have attracted attention because of their high detection rates of unknown attacks. A distributed processing framework for machine-learning-based NIDSs employing a scalable distributed stream processing system has been proposed in the literature. However, its performance, when machine-learning-based classifiers are implemented has not been comprehensively evaluated. In this study, we implement five representative classifiers (Decision Tree, Random Forest, Naive Bayes, SVM, and kNN) based on this framework and evaluate their throughput and latency. By conducting the experimental measurements, we investigate the difference in the processing performance among these classifiers and the bottlenecks in the processing performance of the framework.
Machine Learning Tutorial with Python, Jupyter, KSQL and TensorFlow
When Michelangelo started, the most urgent and highest impact use cases were some very high scale problems, which led us to build around Apache Spark (for large-scale data processing and model training) and Java (for low latency, high throughput online serving). This structure worked well for production training and deployment of many models but left a lot to be desired in terms of overhead, flexibility, and ease of use, especially during early prototyping and experimentation [where Notebooks and Python shine]. Uber expanded Michelangelo "to serve any kind of Python model from any source to support other Machine Learning and Deep Learning frameworks like PyTorch and TensorFlow [instead of just using Spark for everything]." So why did Uber (and many other tech companies) build its own platform and framework-independent machine learning infrastructure? The posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ecosystem as a central, scalable, and mission-critical nervous system. It allows real-time data ingestion, processing, model deployment, and monitoring in a reliable and scalable way. This post focuses on how the Kafka ecosystem can help solve the impedance mismatch between data scientists, data engineers, and production engineers. By leveraging it to build your own scalable machine learning infrastructure and also make your data scientists happy, you can solve the same problems for which Uber built its own ML platform, Michelangelo.
Apache Kafka for Conversational AI, NLP and Chatbot
Natural Language Processing (NLP) helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases. Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference.
ESTemd: A Distributed Processing Framework for Environmental Monitoring based on Apache Kafka Streaming Engine
Distributed networks and real-time systems are becoming the most important components for the new computer age, the Internet of Things (IoT), with huge data streams or data sets generated from sensors and data generated from existing legacy systems. The data generated offers the ability to measure, infer and understand environmental indicators, from delicate ecologies and natural resources to urban environments. This can be achieved through the analysis of the heterogeneous data sources (structured and unstructured). In this paper, we propose a distributed framework Event STream Processing Engine for Environmental Monitoring Domain (ESTemd) for the application of stream processing on heterogeneous environmental data. Our work in this area demonstrates the useful role big data techniques can play in an environmental decision support system, early warning and forecasting systems. The proposed framework addresses the challenges of data heterogeneity from heterogeneous systems and real time processing of huge environmental datasets through a publish/subscribe method via a unified data pipeline with the application of Apache Kafka for real time analytics.
Building AI Models for High-Frequency Streaming Data โ Part Two - KDnuggets
AI continues making headlines in the data science community, and predictive models are front and center in engineering applications such as autonomous driving and equipment monitoring. Introducing AI models into engineering systems can be challenging, however, especially when predictions must be reported in near real-time on data from multiple sensors. Many data scientists have implemented machine or deep learning algorithms on static data or in batch, but what considerations must you make when building models for a streaming environment? In this post, we will discuss these considerations. If streaming movies or music comes to mind, you've got the right idea! Data is incoming continuously, but instead of simply watching, actions must be taken based on the information.
The Fintech Future: Accelerating the AI & ML Journey
Artificial intelligence (AI) has assumed a growing influence within financial services in recent years, affecting areas such as credit decisions, risk management, fraud detection, and stress testing. And for many fintechs, it has been baked into the process from the outset, to the extent that usage of AI in the fintech market registered $6 billion in 2019 and is expected to reach $22 billion by 2025. Economic fallout from the pandemic, however, has accelerated the timetable for financial services firms to become mass adopters of AI and harness its predictive powers sooner rather than later. For digitally native fintechs, many of which have already embraced AI and its capabilities, this offers the opportunity to invest further in the technology and capitalise on the tools available to accelerate their journeys. Fintechs across the world are dealing with the effects of Covid-19 and face an uphill challenge in containing the impact of it on the financial system and broader economy. With rising unemployment and stagnated economies, individuals and companies are struggling with debt, while the world in general is awash in credit risk.
Event Stream Processing: How Banks Can Overcome SQL and NoSQL Related Obstacles with Apache Kafka
While getting to grips with open banking regulation, skyrocketing transaction volumes and expanding customer expectations, banks have been rolling out major transformations of data infrastructure and partnering with Silicon Valley's most innovative tech companies to rebuild the banking business around a central nervous system. This can also be labelled as event stream processing (ESP), which connects everything happening within the business - including applications and data systems - in real-time. ESP allows banks to respond to a series of data points โ events - that are derived from a system that consistently creates data โ the stream โ to then leverage this data through aggregation, analytics, transformations, enrichment and ingestion. Further, ESP is instrumental where batch processing falls short and when action needs to be taken in real-time, rather than on static data or data at rest. However, handling a flow of continuously created data requires a special set of technologies.
Machine Learning Tutorial with Python, Jupyter, KSQL and TensorFlow
When Michelangelo started, the most urgent and highest impact use cases were some very high scale problems, which led us to build around Apache Spark (for large-scale data processing and model training) and Java (for low latency, high throughput online serving). This structure worked well for production training and deployment of many models but left a lot to be desired in terms of overhead, flexibility, and ease of use, especially during early prototyping and experimentation [where Notebooks and Python shine]. Uber expanded Michelangelo "to serve any kind of Python model from any source to support other Machine Learning and Deep Learning frameworks like PyTorch and TensorFlow [instead of just using Spark for everything]." So why did Uber (and many other tech companies) build its own platform and framework-independent machine learning infrastructure? The posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ecosystem as a central, scalable, and mission-critical nervous system. It allows real-time data ingestion, processing, model deployment, and monitoring in a reliable and scalable way. This post focuses on how the Kafka ecosystem can help solve the impedance mismatch between data scientists, data engineers, and production engineers. By leveraging it to build your own scalable machine learning infrastructure and also make your data scientists happy, you can solve the same problems for which Uber built its own ML platform, Michelangelo.
Streaming Machine Learning with Tiered Storage
Both approaches have their pros and cons. The blog post Machine Learning and Real-Time Analytics in Apache Kafka Applications and the Kafka Summit presentation Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFlow discuss this in detail. There are more and more applications where the analytic model is directly embedded into the event streaming application, making it robust, decoupled, and optimized for performance and latency. The model can be loaded into the application when starting it up (e.g., using the TensorFlow Java API). Model management (including versioning) depends on your build pipeline and DevOps strategy. For example, new models can be embedded into a new Kubernetes pod which simply replaces the old pod. Another commonly used option is to send newly trained models (or just the updated weights or hyperparameters) as a Kafka message to a Kafka topic.